Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Response Ops][Alerting] Updating AlertsClient to provide feature parity with rule registry lifecycle executor #160466

Merged
merged 6 commits into from
Jun 29, 2023

Conversation

ymao1
Copy link
Contributor

@ymao1 ymao1 commented Jun 23, 2023

Resolves #160173

Summary

The rule registry lifecycle executor automatically sets the following fields in alert docs:

  • event.action - open, active or close depending on what type of alert
  • event.kind - always signal
  • tags - merges rule tags with rule executor reported tags
  • kibana.version
  • kibana.alert.workflow_status - set to open
  • kibana.alert.time_range

In addition, the rule registry lifecycle executor provides some helper functions for the rule executors to get the alert UUID, the alert start time (if it exists) and the alert document for recovered alerts (used to set recovered context variables).

This PR augments the framework AlertsClient to set the same fields and to provide the same functionality to the rule executors. When an alert is reported via the AlertsClient, the UUID (either existing or newly generated) and the start time (for ongoing alerts) is returned back to the rule executors. When an executor requests the recovered alerts in order to set context information, the existing alert document is returned.

To Verify

Check out this commit which removes the metric threshold rule from the rule registry lifecycle executor and onboards it to use the framework alerts client. Create a metric threshold rule that creates active alerts and recovers alerts. Inspect the alerts documents to make sure all the fields mentioned above exist. Compare these documents with alerts created using the lifecycle executor.

@ymao1 ymao1 force-pushed the alerting/faad-lifecycle branch from cd2a9d9 to 812e0e1 Compare June 26, 2023 14:34
…ting alerts client to return alert info to rule executors
@ymao1 ymao1 force-pushed the alerting/faad-lifecycle branch from 4f30554 to c82a20d Compare June 27, 2023 16:27
@ymao1 ymao1 changed the title Alerting/faad lifecycle [Response Ops][Alerting] Updating AlertsClient to provide feature parity with rule registry lifecycle executor Jun 27, 2023
@ymao1 ymao1 self-assigned this Jun 28, 2023
@ymao1 ymao1 added Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) Feature:Alerting/Alerts-as-Data Issues related to Alerts-as-data and RuleRegistry v8.10.0 labels Jun 28, 2023
@ymao1 ymao1 marked this pull request as ready for review June 28, 2023 16:21
@ymao1 ymao1 requested review from a team as code owners June 28, 2023 16:21
@elasticmachine
Copy link
Contributor

Pinging @elastic/response-ops (Team:ResponseOps)

@ymao1 ymao1 added the release_note:skip Skip the PR/issue when compiling release notes label Jun 28, 2023
Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only reviewed the code, not the tests, nor done a live test - which I will :-)

However, thought I'd send comments from the code. Specifically interested in the state.start thing issue / PR ...

@@ -76,6 +77,10 @@ export class Alert<
return this.meta.uuid!;
}

getStart(): string | null {
return this.state.start ? (this.state.start as string) : null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like ${this.state.start} would be safer than this.state.start as string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 425c34f

...(legacyAlert.getState().duration
? { duration: { us: legacyAlert.getState().duration } }
: {}),
...(legacyAlert.getState().start ? { start: legacyAlert.getState().start } : {}),
...(legacyAlert.getState().start
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm ... I totally forgot about our issue with start, end and duration in state: #144929 ; the PR to fix this never got merged, I'd guess it's in a bad state by now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, we never merged that PR :(

@@ -96,17 +96,22 @@ export interface TrackedAlerts<
recovered: Record<string, LegacyAlert<State, Context>>;
}

// allows Partial on nested objects
export type RecursivePartial<T> = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is still needed, but for some reason I extended this pattern a bit in event log to handle arrays:

type DeepPartial<T> = {
[P in keyof T]?: T[P] extends Array<infer U> ? Array<DeepPartial<U>> : DeepPartial<T[P]>;
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out they're everywhere! There's a DeepPartial in @kbn/utility-types that I ended up using. Updated in 425c34f

@pmuellr pmuellr self-requested a review June 28, 2023 19:43
@ymao1
Copy link
Contributor Author

ymao1 commented Jun 29, 2023

@elasticmachine merge upstream

@kibana-ci
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] Investigations - Security Solution Tests #2 / Detections : Page Filters Alert list is updated when the alerts are updated
  • [job] [logs] Explore - Security Solution Tests #2 / Entity Analytics Dashboard With anomalies data enables a job

Metrics [docs]

Unknown metric groups

ESLint disabled line counts

id before after diff
enterpriseSearch 14 16 +2
securitySolution 413 417 +4
total +6

Total ESLint disabled count

id before after diff
enterpriseSearch 15 17 +2
securitySolution 492 496 +4
total +6

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

cc @ymao1

@pmuellr
Copy link
Member

pmuellr commented Jun 29, 2023

I'm seeing a difference between the "old" (without the commit that has mt use framework) and "new" (with the commit that has mt use framework).

In "old", the JSON doc is how I would expect:

 "_source": {
    "kibana": {
      "alert": {
        "evaluation": {
          "values": [
            10.8
          ]
        },
        "reason": "kibana.alert.rule.execution.metrics.rule_type_run_duration_ms is 10.8 in the last 1 min. Alert when < 100.",
        "action_group": "recovered",
        "flapping": false,
       ... // rest here
    }
  }
}

in "new", it's flattened:

  "_source": {
    "kibana.alert.uuid": "a15920f4-4f9c-48fd-8cde-8a6439f425b7",
    "kibana.alert.status": "recovered",
    "kibana.alert.workflow_status": "open",
    "event.kind": "signal",
    "event.action": "close",
    "kibana.version": "8.10.0",
    "kibana.alert.flapping": true,
  }

Given conversation yesterday, I believe the _source is something customers shouldn't be relying on. So, perhaps this is fine. The fields structure in the alert docs looks fine.

@ymao1
Copy link
Contributor Author

ymao1 commented Jun 29, 2023

I'm seeing a difference between the "old" (without the commit that has mt use framework) and "new" (with the commit that has mt use framework).

The rule registry alert doc is actually the flattened version while the alerting framework generated doc is unflattened. Mike and I discussed this difference and decided it was ok since it shouldn't break any queries and it shows up in the alerts table ok (and as you said, using _source is not recommended)

Copy link
Member

@pmuellr pmuellr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Tested it, and it appears the new alert docs have flattened _source, but the old docs have object structures. Noted that in a separate comment though, and seems like it's probably fine.

@@ -1750,12 +1795,18 @@ describe('Alerts Client', () => {
tags: ['rule-', '-tags'],
uuid: '1',
},
start: '2023-03-28T12:27:28.159Z',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why you changed the date fromr 11 to 12! Everything looks fine, just wondering :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know! I think I got confused and thought that 12:27 was the hard-coded value I have Date.now() resolve to (which is actually 22:27) so I wanted to make sure the start time was different. Too many numbers!

@ymao1 ymao1 merged commit 50049ac into elastic:main Jun 29, 2023
@kibanamachine kibanamachine added the backport:skip This commit does not require backporting label Jun 29, 2023
@ymao1 ymao1 deleted the alerting/faad-lifecycle branch June 29, 2023 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting Feature:Alerting/Alerts-as-Data Issues related to Alerts-as-data and RuleRegistry Feature:Alerting release_note:skip Skip the PR/issue when compiling release notes Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) v8.10.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Response Ops][Alerting] Move lifecycle rule executor logic into FAAD alerts API
6 participants